Things to do before fitting a growth model

Author

Frederick Anyan

Published

September 13, 2023

Modified

October 8, 2024

Some practical preliminary steps are useful for performing longitudinal data analyses. Some preliminary steps might even already show whether the data will support your hypothesized growth function or to fit a different growth function. Here, I go through some of the practical things to do before performing longitudinal data analyses (using wide or long dataset).

It is important to consider whether the measured variables meet different measurement assumptions, including
  • Reliability of the measurement instrument over repeated observations since a good indication of reliability over repeated observations suggests longitudinal reliability of the instrument. Cronbach’s \(α\) at each time point can be calculated to show reliability of the measurement instrument. This, however, does not mean that there is reliable change within individuals over time.

  • Related to reliability of the measurement instrument is ensuring that the changes observed are true changes in individuals and not changes in the measurement instrument or changes in the meaning of the attribute under study over repeated observation. Measurement invariance testing can be used useful here, but could also be difficult when repeated observations spans over several years (ages) as the meaning of attributes/constructs may differ for people of different ages

You can read more about things to do before fitting growth models in the reference below:

Load packages

Code
suppressPackageStartupMessages({
library(psych)  #For practical preliminaries: univariate and bivariate descriptive stats
library(tidyr) #For reshaping data
library(lcsm)   #For plotting longitudinal trajectories in a wide data set
library(ggplot2) #For plotting longitudinal trajectories in long data set
})
Warning: package 'psych' was built under R version 4.3.3
Code
## Read data 
data <- read.csv("/Volumes/anyan-1/frederickanyan.github.io/quantpost_data/data.csv")
#Create new data set with only your main outcome variables
lonely <- data[, c("personid", "lone1", "lone2", "lone3", "lone4", "lone5")]

1. Examine univariate and bivariate statistics

Code
#Examine descriptive statistics.
describe(lonely[, 2:6])#univariate descriptives
      vars   n mean   sd median trimmed  mad min  max range skew kurtosis   se
lone1    1 398 1.46 0.46   1.32    1.38 0.34   1 3.75  2.75 1.81     3.95 0.02
lone2    2 395 1.47 0.52   1.31    1.38 0.31   1 5.00  4.00 2.22     7.22 0.03
lone3    3 390 1.53 0.52   1.36    1.44 0.37   1 3.78  2.78 1.52     2.27 0.03
lone4    4 413 1.35 0.47   1.22    1.25 0.25   1 4.92  3.92 3.05    12.75 0.02
lone5    5 409 1.32 0.39   1.20    1.24 0.25   1 3.43  2.43 2.45     7.22 0.02

First thing to notice from the descriptive statistics is the number and pattern of missing data. It can also be noticed that, the means and standard deviations show a simple pattern with increases in the feeling of loneliness from T1 through to T3, and begins to decline afterwards though to T5 coupled with increases in variation and then a decline after T3.

It can be noticed already from the means that a linear growth function might not accurately characterize the trajectory in the data.

2. Describe covariance and correlation matrices

Code
#bivariate descriptives
cov(lonely[, 2:6], use='pairwise.complete.obs') #covariance matrix
           lone1      lone2      lone3      lone4      lone5
lone1 0.21451070 0.14910019 0.09978644 0.08198277 0.05209327
lone2 0.14910019 0.27059267 0.13796998 0.10077467 0.06904941
lone3 0.09978644 0.13796998 0.27223396 0.10690494 0.08475463
lone4 0.08198277 0.10077467 0.10690494 0.21720685 0.07542655
lone5 0.05209327 0.06904941 0.08475463 0.07542655 0.15471723

The feasibility of estimating a growth model can also be already determined by examining the covariance matrix. If the covariances between two adjacent time points (T1 and T2; T2 and T3; T3 and T4; T4 and T5) are higher than non-adjacent time points, this could likely indicate non-negative slope variance. For example, in our covariance matrix the observed covariances between two adjacent time points are 0.15, 0.14, 0.11 and 0.08. These covariances are sometimes higher but also smaller than non-adjacent time points and thus, does not easily determine that there would be no negative slope variance.

Code
cor(lonely[, 2:6], use='pairwise.complete.obs') #correlation matrix
          lone1     lone2     lone3     lone4     lone5
lone1 1.0000000 0.6153198 0.4666493 0.3902312 0.3027420
lone2 0.6153198 1.0000000 0.5184437 0.4142563 0.3557064
lone3 0.4666493 0.5184437 1.0000000 0.4944602 0.4240808
lone4 0.3902312 0.4142563 0.4944602 1.0000000 0.4342628
lone5 0.3027420 0.3557064 0.4240808 0.4342628 1.0000000

The correlations over time provide unique information for longitudinal analysis. Here, most of the correlations show modest associations, indicating that the level of stability of individual differences across time is modest to high.

3. Supplement main analysis with bivariate scatter plots and correlations

Code
#bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.
pairs.panels(lonely[, 2:6], lm = TRUE) #lm = TRUE to fit a regression line

Bivariate scatter plots and correlations along with histograms can be supplemented to the main analysis.

4. Examine longitudinal plots

Code
plot_trajectories(data = lonely,
                  id_var = "personid", 
                  var_list = c("lone1", "lone2", "lone3", "lone4", "lone5"),
                  xlab = "Year", ylab = "Loneliness",
                  connect_missing = FALSE,   #Want to plot only complete observations
                  random_sample_frac = 0.05, #You can select more or less than 5% of the data by adjusting this
                  title_n = TRUE)
Warning: Removed 9 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 10 rows containing missing values or values outside the scale range
(`geom_point()`).

Make longitudinal plots to show participant’s scores of loneliness indexed on the y-axis and time of observation on the x-axis. Here, you can make one longitudinal plot that visualizes the overall trajectory for all participants and one that visualizes the trajectory for a subset of the participants.

5. Examine separate individual longitudinal plots

Code
plot_trajectories(data = lonely,
                  id_var = "personid", 
                  var_list = c("lone1", "lone2", "lone3", "lone4", "lone5"),
                  xlab = "Year", ylab = "Loneliness",
                  connect_missing = FALSE,   #Want to plot only complete observations
                  random_sample_frac = 0.025, #You can select more or less than 5% of the data by adjusting this
                  title_n = TRUE) +
 facet_wrap(~personid)
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_line()`).
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).

My personal choice has been to show separate individual longitudinal plots, but you can decide to show whichever one works for you.

Code
## Reshape from wide to long using tidyr
lonelylong <- lonely %>%
              pivot_longer(cols = 2:6,
                           names_to = "loneyearly",
                           values_to = "value")

1. Examine univariate and bivariate statistics

Code
#Examine descriptive statistics using the wide data set.
describe(lonely[, 2:6]) #univariate descriptives
      vars   n mean   sd median trimmed  mad min  max range skew kurtosis   se
lone1    1 398 1.46 0.46   1.32    1.38 0.34   1 3.75  2.75 1.81     3.95 0.02
lone2    2 395 1.47 0.52   1.31    1.38 0.31   1 5.00  4.00 2.22     7.22 0.03
lone3    3 390 1.53 0.52   1.36    1.44 0.37   1 3.78  2.78 1.52     2.27 0.03
lone4    4 413 1.35 0.47   1.22    1.25 0.25   1 4.92  3.92 3.05    12.75 0.02
lone5    5 409 1.32 0.39   1.20    1.24 0.25   1 3.43  2.43 2.45     7.22 0.02

First thing to notice from the descriptive statistics is the number and pattern of missing data. It can also be noticed that, the means and standard deviations show a simple pattern with increases in the feeling of loneliness from T1 through to T3, and begins to decline afterwards though to T5 coupled with increases in variation and then a decline after T3.

It can be noticed already from the means that a linear growth function might not accurately characterize the trajectory in the data.

2. Describe covariance and correlation matrices

Code
#bivariate descriptives
cov(lonely[, 2:6], use='pairwise.complete.obs') #covariance matrix
           lone1      lone2      lone3      lone4      lone5
lone1 0.21451070 0.14910019 0.09978644 0.08198277 0.05209327
lone2 0.14910019 0.27059267 0.13796998 0.10077467 0.06904941
lone3 0.09978644 0.13796998 0.27223396 0.10690494 0.08475463
lone4 0.08198277 0.10077467 0.10690494 0.21720685 0.07542655
lone5 0.05209327 0.06904941 0.08475463 0.07542655 0.15471723

The feasibility of estimating a growth model can also be already determined by examining the covariance matrix. If the covariances between two adjacent time points (T1 and T2; T2 and T3; T3 and T4; T4 and T5) are higher than non-adjacent time points, this could likely indicate non-negative slope variance. For example, in our covariance matrix the observed covariances between two adjacent time points are 0.15, 0.14, 0.11 and 0.08. These covariances are sometimes higher but also smaller than non-adjacent time points and thus, does not easily determine that there would be no negative slope variance.

Code
cor(lonely[, 2:6], use='pairwise.complete.obs') #correlation matrix
          lone1     lone2     lone3     lone4     lone5
lone1 1.0000000 0.6153198 0.4666493 0.3902312 0.3027420
lone2 0.6153198 1.0000000 0.5184437 0.4142563 0.3557064
lone3 0.4666493 0.5184437 1.0000000 0.4944602 0.4240808
lone4 0.3902312 0.4142563 0.4944602 1.0000000 0.4342628
lone5 0.3027420 0.3557064 0.4240808 0.4342628 1.0000000

The correlations over time provide unique information for longitudinal analysis. Here, most of the correlations show modest associations, indicating that the level of stability of individual differences across time is modest to high.

3. Supplement main analysis with bivariate scatter plots and correlations

Code
#bivariate scatter plots below the diagonal, histograms on the diagonal, and the Pearson correlation above the diagonal.
pairs.panels(lonely[, 2:6], lm = TRUE) #lm = TRUE to fits regression line

Bivariate scatter plots and correlations along with histograms can be supplemented to the main analysis.

4. Examine longitudinal plots

Code
#Longitudinal plots with the long data set
ggplot(data = lonelylong[which(lonelylong$personid < 26),], #Select the first 25 participants to show
       aes(x = loneyearly, y = value, group = personid)) +
       geom_line() +
       #geom_smooth(method = lm, se = FALSE, size = 1) +
       xlab("Time of observation") +
       ylab("Loneliness")
Warning: Removed 17 rows containing missing values or values outside the scale range
(`geom_line()`).

Make longitudinal plots to show participant’s scores of loneliness indexed on the y-axis and time of observation on the x-axis. Here, you can make one longitudinal plot that visualizes the overall trajectory for all participants and one that visualizes the trajectory for a subset of the participants.

5. Include separate individual trajectories

Code
#Longitudinal plots with the long data set
ggplot(data = lonelylong[which(lonelylong$personid < 6),], #Select only five participants 
       aes(x = loneyearly, y = value, group = personid)) +
       geom_line() +     
       #geom_smooth(method = lm, se = FALSE, size = 1) +
       xlab("Time of observation") +
       ylab("Loneliness")+
facet_wrap(~personid)
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_line()`).

My personal choice has been to show separate individual longitudinal plots, but you can decide to show whichever one works for you.